Grasshoppermouse: Put your data in an R package

您所在的位置：网站首页 › create package infojava › Grasshoppermouse: Put your data in an R package

Grasshoppermouse: Put your data in an R package

#Grasshoppermouse: Put your data in an R package| 来源: 网络整理| 查看: 265

Put your data in an R package Ed Hagen https://anthro.vancouver.wsu.edu/people/hagen 10-18-2017

I used to write long R scripts that imported data files, created new variables, reshaped the data, reshaped it again, and spit out results along the way, all in one file.

That worked so long as I never wanted to use that data again. But what if I did? Should I just tack on more code for the new analysis, and then even more code for yet more analyses? That approach will litter your environment with objects that are irrelevant for, and might even interfere with, a particular analysis. Should I copy all the files into a new directory and then hack away at the code? Now I have two copies of the data – which one is definitive? Should I just treat the original data files as the data? Now I have to repeat the same initial processing steps every time I want to reuse that data.

The solution

For each new data set I create a new R data package. This package lives in my library along with ggplot2, dplyr, lme4, and all my other packages, and is accessible in any project or analysis with a simple:

library(mydatapackage)

Creating a data package involves some small costs, but these are far outweighed by the benefits.

Pros Your data is cleanly separated from your analyses. Your data is easily accessible in any future project. Packages have a built-in documentation system so you can easily document all your variables. The documentation for each data frame (or other object) is accessible with ?my_df. Packages have a versioning system so you can keep track of new versions of your data package. Share your data with students and colleagues simply by sharing the package. Archive your data in a public repository simply by uploading the package. Cons Creating a package is a few extra steps. Every change to the data package requires a rebuild step before the changes are available in your analyses. If you forget to rebuild, your analyses will be using the outdated version of your data, something that can be hard to detect. In the early phases of an analysis, you will probably be moving code back and forth from your data package to your analysis until you find the sweet spot between processed data and analyzed data. How to create a data package

I assume you are using RStudio. Although RStudio can create packages using the GUI, I have gotten obscure errors using that feature. Therefore, do not create a new RStudio project using the GUI. Instead, run this code from the console:

# Run these from the R console # Check that the `usethis` package is installed. If not: install.packages("usethis") # Create new package. Directory must not exist. # This also creates a new RStudio project. usethis::create_package("path/to/my/data/package/")

Open your new data package using “Open project…” in RStudio. Then run this code from the console:

# Run this code after opening the new package in RStudio # Set up the data-raw directory and data processing script # You can use any name you want for your data usethis::use_data_raw(name = 'mydataset') # This script in the R directory will contain the documentation. # You can use any name you want. file.create("R/data.R") # Initialize git repository (optional) usethis::use_git()

Put your data files into the data-raw folder. Your new package directory should look something like this:

Package layout

Write your data processing code in a data-raw/mydataset.R script. It would look something like this:

# data-raw/mydataset.R # Data import and processing pipeline library(readr) library(readxl) mydataset

【本文地址】

Grasshoppermouse: Put your data in an R package

Grasshoppermouse: Put your data in an R package

今日新闻

推荐新闻